Updating zero capacity resource semantics by romilbhardwaj · Pull Request #4555 · ray-project/ray

romilbhardwaj · 2019-04-04T02:20:00Z

What do these changes do?

This PR updates the resource quantity semantics. This PR onwards, a resource with capacity 0 implies it resource does not exist, and consequently no data structure must store a resource with capacity zero.

Linter

I've run scripts/format.sh to lint the changes in this PR.

romilbhardwaj · 2019-04-04T02:22:34Z

src/ray/raylet/scheduling_resources.cc

Should we checking for float precision errors here? Something like RAY_CHECK(capacity > 0 - std::numeric_limits<double>::epsilon())

I think either way is fine. If it's not causing any issues right now, then we can leave it. @williamma12 is fixing this properly in a separate PR.

AmplabJenkins · 2019-04-04T02:25:44Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-04-04T05:09:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13487/
Test FAILed.

AmplabJenkins · 2019-04-04T08:46:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13504/
Test FAILed.

raulchen

Thanks for this PR. Looks good to me overall. But @robertnishihara knows better about this code. I'll let him give the approval.

python/ray/services.py

AmplabJenkins · 2019-04-04T21:20:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13541/
Test FAILed.

AmplabJenkins · 2019-04-05T00:09:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13555/
Test FAILed.

AmplabJenkins · 2019-04-05T00:09:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13566/
Test FAILed.

AmplabJenkins · 2019-04-05T00:09:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13568/
Test FAILed.

AmplabJenkins · 2019-04-06T23:51:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13600/
Test FAILed.

AmplabJenkins · 2019-04-07T19:57:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13620/
Test FAILed.

AmplabJenkins · 2019-04-08T06:01:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13632/
Test FAILed.

romilbhardwaj · 2019-04-08T06:58:14Z

jenkins retest this please

AmplabJenkins · 2019-04-08T09:19:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13634/
Test FAILed.

python/ray/tests/test_basic.py

romilbhardwaj · 2019-04-09T01:02:13Z

python/ray/tune/ray_trial_executor.py

@richardliaw We're changing resource semantics - zero capacity resources are now not included in resource datastructures. For instance instead of returning {resource: 0} we now return an empty dictionary.

To make tests work, I've made some changes to resource handling in tune - can you please check if this looks okay?

I pushed a small tweak - thanks for letting me know!

AmplabJenkins · 2019-04-09T01:19:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13644/
Test FAILed.

romilbhardwaj · 2019-04-09T01:25:25Z

jenkins retest this please

AmplabJenkins · 2019-04-09T03:34:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13646/
Test FAILed.

AmplabJenkins · 2019-04-09T03:56:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13647/
Test FAILed.

robertnishihara · 2019-04-09T06:39:05Z

@richardliaw @romilbhardwaj looks like the Jenkins tune test is still failing. resource can be {}, so we are hitting the if not resources: raise(...) code path.

AmplabJenkins · 2019-04-09T10:02:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13653/
Test FAILed.

robertnishihara · 2019-04-10T06:57:30Z

python/ray/tests/test_basic.py

This is probably fine, but 5 retries may not be enough in an environment like Travis, so let's keep an eye on whether this test becomes flaky or not and if it starts failing then increase this.

AmplabJenkins · 2019-04-10T08:54:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13687/
Test FAILed.

Fix typo More fixes. Updates to python functions debug statements Rounding error fixes, removing cpu addition in cython and test fixes. linting Fix worker pool test python linting

AmplabJenkins · 2019-04-11T08:11:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13724/
Test FAILed.

AmplabJenkins · 2019-04-11T09:43:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13725/
Test FAILed.

AmplabJenkins · 2019-04-12T02:03:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13744/
Test FAILed.

AmplabJenkins · 2019-04-12T06:11:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13756/
Test FAILed.

AmplabJenkins · 2019-04-12T06:46:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13757/
Test FAILed.

AmplabJenkins · 2019-04-12T06:53:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13758/
Test FAILed.

raulchen · 2019-04-12T08:41:54Z

java/api/src/main/java/org/ray/api/options/BaseTaskOptions.java

  public BaseTaskOptions(Map<String, Double> resources) {
+    for (Map.Entry<String, Double> entry : resources.entrySet()) {
+      if (entry.getValue().compareTo(0.0) <= 0) {
+        throw new RuntimeException(String.format("Resource capacity should be positive, " +


Let's use IllegalArgumentException here

raulchen · 2019-04-12T08:42:31Z

java/test/src/main/java/org/ray/api/test/ResourcesManagementTest.java

+      CallOptions callOptions3 = new CallOptions(ImmutableMap.of("CPU", 0.0));
+      Assert.fail();
+    } catch (RuntimeException e) {
+      // We should receive a RuntimeException indicate that we should pass a zero capacity resource.


Suggested change

// We should receive a RuntimeException indicate that we should pass a zero capacity resource.

// We should receive a RuntimeException indicates that we should not pass a zero capacity resource.

AmplabJenkins · 2019-04-12T12:16:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13766/
Test FAILed.

ls-daniel · 2019-04-17T16:11:16Z

I believe that this commit may be breaking autoscaling.

With a head node with num-cpus=1, this commit results in the autoscaler LoadMetrics reporting an empty dict for both static and dynamic resouces once 1 job is scheduled on the head node.

This means that the head node never scales up.

robertnishihara · 2019-04-17T21:58:45Z

@ls-daniel thanks for reporting this. We'll look into it.

This reverts commit 0f42f87.

This PR contains changes that help with memory issues ray-project#4555

romilbhardwaj commented Apr 4, 2019

View reviewed changes

romilbhardwaj force-pushed the scheduling-res-zerocap branch from 2aa7e80 to b29bbcc Compare April 4, 2019 06:10

robertnishihara mentioned this pull request Apr 4, 2019

Fix broken pipe callback #4513

Merged

raulchen reviewed Apr 4, 2019

View reviewed changes

python/ray/services.py Outdated Show resolved Hide resolved

robertnishihara reviewed Apr 9, 2019

View reviewed changes

python/ray/tests/test_basic.py Outdated Show resolved Hide resolved

robertnishihara reviewed Apr 9, 2019

View reviewed changes

python/ray/tests/test_basic.py Show resolved Hide resolved

williamma12 mentioned this pull request Apr 9, 2019

Change resource bookkeeping to account for machine precision. #4533

Merged

1 task

romilbhardwaj commented Apr 9, 2019

View reviewed changes

romilbhardwaj force-pushed the scheduling-res-zerocap branch from 3fe4ff5 to 04d23b6 Compare April 10, 2019 06:33

robertnishihara reviewed Apr 10, 2019

View reviewed changes

Updates make zero capacity equivalent to deletion.

1052cce

Fix typo More fixes. Updates to python functions debug statements Rounding error fixes, removing cpu addition in cython and test fixes. linting Fix worker pool test python linting

Remove zero capacity resources from Java API

35539a2

Add an exception for zero capacity resource in Java worker.

74f647a

jovany-wang force-pushed the scheduling-res-zerocap branch from 79d46e0 to 74f647a Compare April 12, 2019 04:17

raulchen reviewed Apr 12, 2019

View reviewed changes

address comments

2c30bc9

robertnishihara approved these changes Apr 12, 2019

View reviewed changes

robertnishihara merged commit 0f42f87 into ray-project:master Apr 12, 2019

romilbhardwaj deleted the scheduling-res-zerocap branch April 13, 2019 00:05

romilbhardwaj restored the scheduling-res-zerocap branch April 17, 2019 22:36

romilbhardwaj added a commit to romilbhardwaj/ray that referenced this pull request Apr 17, 2019

Autoscaler hotfix for ray-project#4555.

7cd1dce

romilbhardwaj mentioned this pull request Apr 18, 2019

Autoscaler hotfix for #4555. #4653

Merged

1 task

ls-daniel added a commit to longshotsyndicate/ray that referenced this pull request Apr 18, 2019

Revert "Updating zero capacity resource semantics (ray-project#4555)"

de9fb22

This reverts commit 0f42f87.

devin-petersohn added a commit that referenced this pull request Apr 18, 2019

Revert "Updating zero capacity resource semantics (#4555)"

618147f

This reverts commit 0f42f87.

ls-daniel pushed a commit to longshotsyndicate/ray that referenced this pull request May 7, 2019

Autoscaler hotfix for ray-project#4555.

fde019c

robertnishihara pushed a commit that referenced this pull request May 8, 2019

Autoscaler hotfix for #4555. (#4653)

0421cba

robertnishihara mentioned this pull request Jun 8, 2019

Fix resource bookkeeping bug with acquiring unknown resource. #4945

Merged

ls-daniel pushed a commit to longshotsyndicate/ray that referenced this pull request Jul 17, 2019

Autoscaler hotfix for ray-project#4555.

03c41c5

Edilmo added a commit to BonsaiAI/ray that referenced this pull request Feb 10, 2020

Cherry picking changes from PR 4555

16e6881

This PR contains changes that help with memory issues ray-project#4555

	// We should receive a RuntimeException indicate that we should pass a zero capacity resource.
	// We should receive a RuntimeException indicates that we should not pass a zero capacity resource.

Conversation

romilbhardwaj commented Apr 4, 2019

What do these changes do?

Linter

Uh oh!

romilbhardwaj Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

robertnishihara Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 4, 2019

Uh oh!

AmplabJenkins commented Apr 4, 2019

Uh oh!

AmplabJenkins commented Apr 4, 2019

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AmplabJenkins commented Apr 4, 2019

Uh oh!

AmplabJenkins commented Apr 5, 2019

Uh oh!

AmplabJenkins commented Apr 5, 2019

Uh oh!

AmplabJenkins commented Apr 5, 2019

Uh oh!

AmplabJenkins commented Apr 6, 2019

Uh oh!

AmplabJenkins commented Apr 7, 2019

Uh oh!

AmplabJenkins commented Apr 8, 2019

Uh oh!

romilbhardwaj commented Apr 8, 2019

Uh oh!

AmplabJenkins commented Apr 8, 2019

Uh oh!

Uh oh!

Uh oh!

romilbhardwaj Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 9, 2019

Uh oh!

romilbhardwaj commented Apr 9, 2019

Uh oh!

AmplabJenkins commented Apr 9, 2019

Uh oh!

AmplabJenkins commented Apr 9, 2019

Uh oh!

robertnishihara commented Apr 9, 2019

Uh oh!

AmplabJenkins commented Apr 9, 2019

Uh oh!

robertnishihara Apr 10, 2019

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 10, 2019

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

AmplabJenkins commented Apr 12, 2019

Uh oh!

AmplabJenkins commented Apr 12, 2019

Uh oh!

AmplabJenkins commented Apr 12, 2019

Uh oh!

AmplabJenkins commented Apr 12, 2019

Uh oh!

raulchen Apr 12, 2019

Choose a reason for hiding this comment

Uh oh!

raulchen Apr 12, 2019